The BOOTStrep BioLexicon \citep{biolexicon} contains automatically-produced verb subcategorization data.  Six million words of MEDLINE E. Coli
abstracts and articles are parsed with the Enju deep
parser \citep{enju}, which has been adapted to the biomedical domain as
described in \citep{hara:06}.  No SCF inventory is assumed in advance; rather,
the set of grammatical relations for each verb instance are considered
as a potential SCF.  These are filtered at a relative frequency
threshold of 0.03, i.e. for any given verb, all SCFs with a relative frequency less than 0.03 are discarded. Filtering leads to an 
% resulting 
inventory of 136 SCFs.  Further arguments and strongly-selected adjuncts are
chosen according to their log-likelihood with respect to the verb.  It
is important to note that the BioLexicon draws on a single subdomain of
biomedical literature.  
Moreover, 
% while it shares the Cambridge system's use of
% parser output for SCF discovery, it differs in that 
the parsing model
used in SCF discovery
is lexicalized, with a built-in notion of subcategorization, and is
tuned for biomedical data using a variety of external resources such
as GENIA \citep{Kim:EtAl:03}.  While there are immediate benefits to these
approaches in terms of accuracy in SCF acquisition within the same
domain as the training data, the model's reliance on manual annotation is costly, and its preconception of subcategorization may introduce
bias against new subdomain behaviors.

The BioLexicon is 
% already 
publicly available through ELRA (http://catalog.elra.info). 
We used BioLexicon exactly as provided without additional training or
adaptation.  Our intention here was to see how a system trained using bio-specific tools, but only on a single subdomain, could perform against a gold standard constructed from a wider variety of subdomains.
